Clay County
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
Zhao, Zehua, Huang, Zhixian, Li, Junren, Lin, Siyu, Zhou, Junting, Cao, Fengqi, Zhou, Kun, Ge, Rui, Long, Tingting, Zhu, Yuexiang, Liu, Yan, Zheng, Jie, Wei, Junnian, Zhu, Rong, Zou, Peng, Li, Wenyu, Cheng, Zekai, Ding, Tian, Wang, Yaxuan, Yan, Yizhao, Wei, Tingru, Ming, Haowei, Mao, Weijie, Sun, Chen, Liu, Yiming, Wang, Zichen, Zhang, Zuo, Yang, Tong, Ma, Hao, Gao, Zhen, Pei, Jian
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.
- Materials > Chemicals (1.00)
- Education > Curriculum > Subject-Specific Education (0.67)
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Hu, Xiang, Zhou, Zhanchao, Liang, Ruiqi, Li, Zehuan, Wu, Wei, Li, Jianguo
This work explores the challenge of building "Machines that Can Remember", framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: sparsity, random-access flexibility, and length generalization. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling. Figure 1: Despite being pre-trained with an 8K context window and mid-trained up to 32K, HSA-UltraLong achieves near-perfect accuracy on S-NIAH even at a 16M-token context length. The red dashed line at 32K marks the boundary between in-domain (left) and out-of-domain (right).
- Europe > Austria > Vienna (0.14)
- North America > United States > Texas > Clay County (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (3 more...)
From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction
Yan, Bencheng, Lei, Yuejie, Zeng, Zhiyuan, Wang, Di, Lin, Kaiyi, Wang, Pengjie, Xu, Jian, Zheng, Bo
Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Montana > Roosevelt County (0.04)
- North America > United States > Texas > Clay County (0.04)
- (2 more...)
T able of Contents
Failure cases of GET . It is worth noting that the Gaussian equivalence property (Theorem 3) may no longer hold if we train the features longer. In particular, because of our mean-field parameteri-zation, the first-layer weight W needs to travel sufficiently far away from initialization to achieve small training loss (see Figure 2). Hence in our experimental simulations (where n,d,N are large but finite), as the number of steps t increases, we expect the Gaussian equivalence predictions to become inaccurate at some point. This transition is empirically demonstrated in Figure 4(a).
- North America > United States > Texas > Clay County (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada > Quebec > Montreal (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Texas > Clay County (0.04)
- Europe > France (0.04)
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics. Remarkably, the limit is independent of the scaling of $M$, covering practical cases of, say, Transformers, where $M$ (the number of hidden units or attention heads per layer) is typically of the order of $D$. For a residual scale $Θ_D\big(\fracα{LM}\big)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \fracα{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps, and we verify empirically that this rate is tight. When $α=Θ(1)$, the limit exhibits complete feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $α\to \infty$ yields a \lazy ODE regime where the Mean ODE is linearly parameterized. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We show that for this model, the only residual scale that leads to complete feature learning is $Θ\big(\frac{\sqrt{D}}{LM}\big)$. In this regime, we prove the error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps, which is also empirically tight. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics.
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > Texas > Clay County (0.04)
- North America > Canada > Alberta > Census Division No. 8 > Red Deer County (0.04)
- (4 more...)
CURVALID: Geometrically-guided Adversarial Prompt Detection
Yung, Canaan, Huang, Hanxun, Erfani, Sarah Monazam, Leckie, Christopher
Adversarial prompts capable of jailbreaking large language models (LLMs) and inducing undesirable behaviours pose a significant obstacle to their safe deployment. Current mitigation strategies rely on activating built-in defence mechanisms or fine-tuning the LLMs, but the fundamental distinctions between adversarial and benign prompts are yet to be understood. In this work, we introduce CurvaLID, a novel defense framework that efficiently detects adversarial prompts by leveraging their geometric properties. It is agnostic to the type of LLM, offering a unified detection framework across diverse adversarial prompts and LLM architectures. CurvaLID builds on the geometric analysis of text prompts to uncover their underlying differences. We theoretically extend the concept of curvature via the Whewell equation into an $n$-dimensional word embedding space, enabling us to quantify local geometric properties, including semantic shifts and curvature in the underlying manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to capture geometric features of text prompts within adversarial subspaces. Our findings reveal that adversarial prompts differ fundamentally from benign prompts in terms of their geometric characteristics. Our results demonstrate that CurvaLID delivers superior detection and rejection of adversarial queries, paving the way for safer LLM deployment. The source code can be found at https://github.com/Cancanxxx/CurvaLID
- Oceania > Australia (0.04)
- North America > United States > Texas > Clay County (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Applications of Statistical Field Theory in Deep Learning
Ringel, Zohar, Rubin, Noa, Mor, Edo, Helias, Moritz, Seroussi, Inbar
Deep learning algorithms have made incredible strides in the past decade yet due to the complexity of these algorithms, the science of deep learning remains in its early stages. Being an experimentally driven field, it is natural to seek a theory of deep learning within the physics paradigm. As deep learning is largely about learning functions and distributions over functions, statistical field theory, a rich and versatile toolbox for tackling complex distributions over functions (fields) is an obvious choice of formalism. Research efforts carried out in the past few years have demonstrated the ability of field theory to provide useful insights on generalization, implicit bias, and feature learning effects. Here we provide a pedagogical review of this emerging line of research.
- North America > United States > Texas > Clay County (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- (9 more...)
- Research Report (0.64)
- Instructional Material (0.45)